Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add immutable LIRS Cache implementation #155

Merged
merged 22 commits into from
Dec 16, 2013

Conversation

jcoveney
Copy link
Contributor

This fixes #97

new Stack[K](maxSize, backingIndexMap, backingKeyMap)
}

sealed class Stack[K](maxSize:Int, backingIndexMap:SortedMap[Int, K], backingKeyMap:Map[K, Int]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, but did you mean to say final here instead of sealed? I don't see any subclasses of Stack in this file. I believe it's the same case for LIRSStacks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misunderstood what sealed does. What's the preferred way to make a class private? Just to make it final? Make the constructor private?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sealed is usually used when you have a fixed number of implementations, and they must be defined within the same source file. In this case maybe you'd want to just say private[storehaus]. I'm not sure what the other caches do, but at least you're signaling that it's private to the storehaus package and isn't meant for public consumption.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oscar didn't love all the privates. He suggested tossing it all in a private class if I really care, otherwise just leaving it. What do you think is the best way? I just don't want people to depend on internals.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it's funny that you mention that. Today I saw a pull request come
in for Scala's new pickling support that attempted to make classes private
that were not intended for external consumption. The initial approach was
to mark things as "private[pickling]", but someone commented that you can
actually still get access to them if you do something like "import
pickling._" (there's some esoteric bug that was linked to). So, instead he
suggested putting these private classes in an "internal" package like this:
https://github.com/scala/async/tree/master/src/main/scala/scala/async/internal

On Fri, Oct 11, 2013 at 2:16 PM, Jonathan Coveney
[email protected]:

In
storehaus-cache/src/main/scala/com/twitter/storehaus/cache/LIRSCache.scala:

  • * limitations under the License.
  • */
    +
    +package com.twitter.storehaus.cache
    +
    +import scala.collection.SortedMap
    +import scala.annotation.tailrec
    +
    +object Stack {
  • def apply[K](maxSize:Int,
  •           backingIndexMap:SortedMap[Int, K] = SortedMap.empty[Int, K],
    
  •           backingKeyMap:Map[K, Int] = Map.empty[K, Int]) =
    
  • new Stack[K](maxSize, backingIndexMap, backingKeyMap)
    +}

+sealed class Stack[K](maxSize:Int, backingIndexMap:SortedMap[Int, K], backingKeyMap:Map[K, Int]) {

Oscar didn't love all the privates. He suggested tossing it all in a
private class if I really care, otherwise just leaving it. What do you
think is the best way? I just don't want people to depend on internals.


Reply to this email directly or view it on GitHubhttps://github.com//pull/155/files#r6927010
.

@ryanlecompte
Copy link
Contributor

Hey @jcoveney, so I took a look at the CyclicIncrementer. I think it's a cool implementation, although it definitely adds some complexity to the code. I just took some time to step back and figure out exactly what we want here. I think that we essentially want a bounded stack that's mapped by an efficient array-based data structure as well as a map. I took a stab at implementing a BoundedStack that doesn't rely on ever-increasing integers. It's backed by a Vector and a Map, and is fully immutable. I performed a few tests in the REPL and it seems to do the trick. It implements the same API as your existing Stack. Do you think this could satisfy our needs? It might be a nice general utility to add to storehaus, and perhaps we can also use it for LRUCache. If you like it, I can submit a pull req that includes it:

object BoundedStack {
  def apply[A](maxSize: Int): BoundedStack[A] =
    new BoundedStack[A](maxSize, Vector.empty[A], Map.empty[A, Int])
}

class BoundedStack[A](
  maxSize: Int,
  backingStack: Vector[A] = Vector.empty,
  backingIndexMap: Map[A, Int] = Map.empty) {

  def putOnTop(elem: A): (Option[A], BoundedStack[A]) = {
    backingIndexMap.get(elem) match {
      case Some(idx) =>
        val newBackingStack = dropFromStack(idx) :+ elem
        val newIdx = newBackingStack.size - 1
        val newBackingIndexMap = backingIndexMap + (elem -> newIdx)
        (None, new BoundedStack(maxSize, newBackingStack, newBackingIndexMap))
      case None =>
        if (isFull) {
          val oldElem = backingStack.head
          val newBackingStack = dropFromStack(0) :+ elem
          val newIdx = newBackingStack.size - 1
          val newBackingIndexMap = backingIndexMap - oldElem + (elem -> newIdx)
          (Some(oldElem), new BoundedStack(maxSize, newBackingStack, newBackingIndexMap))
        } else {
          val newBackingStack = backingStack :+ elem
          val newIdx = newBackingStack.size - 1
          val newBackingIndexMap = backingIndexMap + (elem -> newIdx)
          (None, new BoundedStack(maxSize, newBackingStack, newBackingIndexMap))
        }
    }
  }

  def dropOldest: (Option[A], BoundedStack[A]) = {
    backingStack.headOption match {
      case e @ Some(elem) =>
        val newBackingStack = dropFromStack(0)
        val newBackingIndexMap = backingIndexMap - elem
        (e, new BoundedStack(maxSize, newBackingStack, newBackingIndexMap))
      case None =>
        (None, this)
    }
  }

  def remove(elem: A): (Option[A], BoundedStack[A]) = {
    backingIndexMap.get(elem) match {
      case Some(idx) =>
        val newBackingStack = dropFromStack(idx)
        val newBackingIndexMap = backingIndexMap - elem
        (Some(elem), new BoundedStack(maxSize, newBackingStack, newBackingIndexMap))
      case None =>
        (None, this)
    }
  }

  def contains(elem: A): Boolean = backingIndexMap.contains(elem)

  def isOldest(elem: A): Boolean = backingStack.headOption.exists { _ == elem }

  def size: Int = backingStack.size

  def isFull: Boolean = size >= maxSize

  def isEmpty: Boolean = !isFull

  def empty: BoundedStack[A] = new BoundedStack[A](maxSize, Vector.empty, Map.empty)

  override def toString = backingStack.mkString(",")

  private def dropFromStack(idx: Int): Vector[A] = backingStack.patch(idx, Seq.empty, 1)
}

@ryanlecompte
Copy link
Contributor

Also, Vector is effectively constant for appends: http://docs.scala-lang.org/overviews/collections/performance-characteristics.html

@jcoveney
Copy link
Contributor Author

Hey, I really like this idea. I actually originally implemented it with a Vector as well, as they are a sweet, underused data type. The one thing I'm not sure about is how patch is implemented. It seems like vectors are optimized for efficient operations on the ends, but I'm not sure how they do random removals. If that is efficient, then it is definitely a great data structure to use. I just need to look at the implementation of patch. Do you know if they can drop things efficiently? Because if not, that is the big loss here.

@ryanlecompte
Copy link
Contributor

So, I did a bit of digging into this and came across this thread: https://groups.google.com/forum/#!topic/scala-user/fZ1TTNgneW4

It looks like some people suggest using patch, and others came up with less visually appealing solutions that may perform better than patch. I performed a quick benchmark using Thyme for a Vector with 100K elements, and removing an item at an arbitrary index using patch is pretty painfully slow because it ends up having to reshuffle/rebuild the Vector from scratch it seems (unlike Vector#updated which is very fast but only allows replacing an element, not removing it).

However, I wonder if we are just being overly paranoid in general? I think we could probably safely just convert your original implementation to use a Long instead of an Int. Let's think about it:

Let's assume that a LARSCache gets accessed roughly 100K times a second (very aggressive for a single machine). And, let's assume that each access causes the index to increment. If the cache consistently gets hit 100K times a second for an entire year, that's 3153600000000 (100K * # of seconds in a year = 100K * 31536000). Long.MaxValue is 9223372036854775807. So, at that aggressive rate of 100K hits/s you would end up with Long.MaxValue / 3153600000000 = 2924712 years before it overflows!

I am leaning towards switching to Long in your original implementation and avoiding the complications added by the CyclicIncrementer. Thoughts?

@johnynek
Copy link
Collaborator

How about we split the difference:

trait IdProvider[T] {
  def next: (T, IdProvider[T])
  def cull(t: T): IdProvider[T]
}

case class LongProvider(next: Long) extends IdProvider[Long] {
  def next = (next, LongProvider(next + 1L))
  def cull(l: Long) = this
}

object IdProvider {
  implicit def longCounter: IdProvider[Long] = LongProvider(0L)
}

Then implement this typeclass (which you have already done) for CyclicIncrement[T] so we have:

object CyclicIncrement {
  implicit def idProvider[T](implicit anyNeededHere, or make in just T == Int): IdProvider[CyclicIncrement[T]] = // wrap your CyclicProvider
}

Lastly, Stack can be made to use a generic provider with the typeclass pattern, and Ryan can inject the LongProvider by using the Id type == Long, and you can use Cyclic with Id = CyclicIncrement[Int]

@ryanlecompte
Copy link
Contributor

That's a really nice way to split the difference and a sweet use of the typeclass pattern. I really wish Scala had a scala.collection.immutable.LinkedHashMap. Unfortunately there's just a mutable one, but that's pretty much exactly what we want here (i.e., a bounded LinkedHashMap). Here's a quick version of the stack using a mutable LinkedHashMap. Certainly simplifies things, but would require synchronization of course and only really makes sense for a mutable version of a LIRS cache. Would actually be interesting to see the performance differences between a thread-safe mutable LIRS cache vs. the immutable one:

import scala.collection.mutable

object BoundedStack {
  def apply[A](maxSize: Int): BoundedStack[A] = new BoundedStack[A](maxSize)
}

class BoundedStack[A](maxSize: Int) {
  private val backingMap = mutable.LinkedHashMap.empty[A, A]

  def putOnTop(elem: A): Option[A] = {
    backingMap.get(elem) match {
      case Some(_) =>
        backingMap -= elem
        insert(elem)
        None
      case None =>
        if (isFull) {
          val oldElem = dropOldest
          insert(elem)
          oldElem
        } else {
          insert(elem)
          None
        }
    }
  }

  def dropOldest: Option[A] = {
    backingMap.headOption match {
      case Some((elem, _)) =>
        backingMap -= elem
        Some(elem)
      case None =>
        None
    }
  }

  def remove(elem: A): Option[A] = {
    backingMap.get(elem) match {
      case s @ Some(_) =>
        backingMap -= elem
        s
      case None =>
        None
    }
  }

  def contains(elem: A): Boolean = backingMap.contains(elem)

  def isOldest(elem: A): Boolean = backingMap.keys.headOption.exists { _ == elem }

  def size: Int = backingMap.size

  def isFull: Boolean = size >= maxSize

  def isEmpty: Boolean = !isFull

  def empty: BoundedStack[A] = new BoundedStack[A](maxSize)

  override def toString = backingMap.keys.mkString(",")

  private def insert(elem: A) { backingMap += elem -> elem }
}

@jcoveney
Copy link
Contributor Author

Another data structure that could work is the one that clojure uses, the priority map: https://github.com/clojure/data.priority-map

@jcoveney
Copy link
Contributor Author

As far as my thoughts: I agree that going with the trait is the way to go, good suggestion oscar. That said, I don't think we should go with the vector approach. And yeah, I'd be curious to compare the immutable implementation with the mutable one...though the wins from making it immutable, of course, transcend performance concerns.

@ryanlecompte
Copy link
Contributor

Indeed. :) Agree with everything!

@johnynek
Copy link
Collaborator

Looks like we are converging here. Nice. One thought: rather than IdProvider[T] what about Clock[T] basically it gives you a now and a tick to go to the next "now". MutableClock[T] would be one that tick returns unit, and now returns the latest time (like a wrapper System.currentTimeMillis). Lastly note that Snowflake, twitter's unique ID generator is a kind of MutableClock.

Just an idea. Maybe IdProvider is more to the point, but I liked the analogy to a kind of abstract Time.

@jcoveney
Copy link
Contributor Author

jcoveney commented Nov 5, 2013

I incorporated the clock idea, added some tests, and I think it's good to go. If someone (maybe me!) implements a priority map, we can use that in these cache implementations (and in scala period, hah) and it'd be really cool.

@ryanlecompte
Copy link
Contributor

Nice! I like it. Great work. Immutable data structures are so much fun. I had fun writing an immutable patricia trie (prefix map) over the weekend: https://gist.github.com/ryanlecompte/7287415
immutability FTW!

}

trait IdProvider[@specialized(Int, Long) T] extends Clock[T, IdProvider[T]] {
def cull(oldId: T): IdProvider[T]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment on the meaning here. (A bit like "free" right? in a malloc sense).

@jcoveney
Copy link
Contributor Author

Note that this cannot be included until a new Algebird is pushed b/c it uses Successible.

@jcoveney
Copy link
Contributor Author

With the new algebird release, this now passes and can be merged :)

object CyclicIncrementProvider {
def intIncrementer: CyclicIncrementProvider[Int] = CyclicIncrementProvider(0)

def apply[@specialized(Int, Long) K: Ordering: Successible](zero: K): CyclicIncrementProvider[K] =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't Successible have an ordering? You could weaken this to Successible and just get the ordering from there, right?

johnynek added a commit that referenced this pull request Dec 16, 2013
Add immutable LIRS Cache implementation
@johnynek johnynek merged commit fa32ddb into twitter:develop Dec 16, 2013
@johnynek
Copy link
Collaborator

By the way, it would be nice to have some benchmarking code for cache hit-rate for a variety of distributions on key space (e.g. probability of seeing key i is ~ 1/i for N keys).

@jcoveney jcoveney deleted the jco/lirscache branch July 15, 2014 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LIRS Cache
3 participants